sentence and score label. Read the specificiations of the dataset for details. helper functions in the folder of the first lab session (notice they may need modification) or create your own.You can submit your homework following these guidelines: Git Intro & How to hand your homework. Make sure to commit and save your changes to your repository BEFORE the deadline (Oct. 29th 11:59 pm, Tuesday).
# necessary for when working with external scripts
%load_ext autoreload
%autoreload 2
# categories
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
# obtain the documents containing the categories provided
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, \
shuffle=True, random_state=42)
import pandas as pd
# my functions
import helpers.data_mining_helpers as dmh
# construct dataframe from a list
X = pd.DataFrame.from_records(dmh.format_rows(twenty_train), columns= ['text'])
# add category to the dataframe
X['category'] = twenty_train.target
# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))
# Get a copy of the original X dataframe for later excercise
X_copy = X.copy()
#Access Index directly as an attribute, query every 10th record, showing first ten
X.text[::10][0:10]
#Access via loc and label slices
X.loc[::10,'text':'category_name'][0:10]
#Access scalar value with iat
X.iat[0,0]
There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.
Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?
import numpy as np
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
{ 'id': 'B' },
{ 'id': 'C', 'missing_example': 'NaN' },
{ 'id': 'D', 'missing_example': 'None' },
{ 'id': 'E', 'missing_example': None },
{ 'id': 'F', 'missing_example': '' }]
NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
NA_df['missing_example'].isnull()
# Answer here
# isnull(): indictates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike)
NA_df['missing_example']
isnull() indicates None or NaN in array-like objects.
For A, the value is NaN, which will be captured by isnull()
For B, the value is missing, which will be captured by isnull()
For C and D, the values are the strings "NaN" and "None", they are not missing, so they won't be captured by isnull()
For E, None is captured by isnull(), according to the documentation
For F, the value is the empty string '', it is not missing, so it won't be captured by isnull()
# Duplicate Operations
dummy_duplicate_dict = [{
'text': 'dummy record',
'category': 1,
'category_name': "dummy category"
},
{
'text': 'dummy record',
'category': 1,
'category_name': "dummy category"
}]
X = X.append(dummy_duplicate_dict, ignore_index=True)
X.drop_duplicates(keep=False, inplace=True) # inplace applies changes directly on our dataframe
# Sampling Operations
X_sample = X.sample(n=1000) #random state
Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.
# Answer here
X.equals(X_copy)
According to the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html, equals() allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. X_copy is the copy of the dataframe X before all operations. "True" means all elements are the same in both objects.
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.

# Answer Here
import matplotlib.pyplot as plt
%matplotlib inline
# the distribution of the data
distribution_X = X.category_name.value_counts()
distribution_sample = X_sample.category_name.value_counts()
print(distribution_X)
print("")
print(distribution_sample)
print("")
# Get the categories(labels for X axis)
X_label = X.category_name.value_counts().index
print(X_label)
# Plot the figure
plt.figure()
index = np.arange(0,len(X_label)) # for positioning the bars
bar_width = 0.2 # set the width of the bars
# Bar plot for Distribution of X
A = plt.bar(index+0.9,
distribution_X,
bar_width,
alpha=1,
label="X") # Legend
# Bar plot for Distribution of X_sample
B = plt.bar(index+1.1,
distribution_sample,
bar_width,
alpha=1,
label="X_sample") # Legend
plt.ylabel("Num of samples")
plt.xticks(index+1, list(X_label)) # Labels at X axis
plt.title('Side by Side distribution')
plt.legend() # Show legend on the plot
plt.ylim(0,650)
plt.grid(True) # Plot grid, better for visualization
plt.show()
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.text)
analyze = count_vect.build_analyzer()
# we convert from sparse array to normal array
X_counts[0:5, 0:100].toarray()
We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.
# Answer here
import numpy as np
feat_names = count_vect.get_feature_names()[0:100] # Get names for first 100 features
idx = np.where(X_counts[4, 0:100].toarray()[0] == 1)[0] # Get indices of the term that occurs in the document
print("Indexes of 1s: ", idx)
print("The term is: ",feat_names[idx[1]]) # Get the term and print
# first twenty features only
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:20]]
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_counts[0:20, 0:20].toarray()
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(9, 7))
ax = sns.heatmap(df_todraw,
cmap="PuRd",
vmin=0, vmax=1, annot=True)
From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization
In order to reduce the size of the heatmap, we can randomly sample 50 documents and 50 terms each to plot the heatmap.
from random import sample
# Sample 50 terms by index
all_feat_idx = list(np.arange(0,len(count_vect.get_feature_names())))
sampled_feat_idx = sample(all_feat_idx,50)
# Sample 50 documents by index
sampled_doc_idx = sample(list(X.index),50)
X_counts_sampled = X_counts[:,sampled_feat_idx]
X_counts_sampled = X_counts_sampled[sampled_doc_idx,:]
X_counts_sampled
%matplotlib inline
# Get sampled feature names
feat_names = np.array(count_vect.get_feature_names())[sampled_feat_idx]
# first twenty features only
plot_x = ["term_"+str(i) for i in feat_names]
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(X.index[sampled_doc_idx])]
plot_z = X_counts_sampled.toarray()
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(15, 15))
ax = sns.heatmap(df_todraw,
cmap="PuRd", annot=True)
We can also randomly sample documents, but filter out terms within a certain range of frequency to show in the heatmap.
# Get terms within a certain frequency range
term_frequencies=np.asarray(X_counts.sum(axis=0))[0]
frequent_terms_idx = np.where(np.multiply(term_frequencies>200,term_frequencies<250))[0] # get term(indices) within frequency range 200~250
z=np.array(count_vect.get_feature_names())
len(z[frequent_terms_idx])
# Sample 50 documents by index
sampled_doc_idx = sample(list(X.index),50)
X_counts_sampled = X_counts[:,frequent_terms_idx]
X_counts_sampled = X_counts_sampled[sampled_doc_idx,:]
X_counts_sampled
%matplotlib inline
# Get sampled feature names
feat_names = np.array(count_vect.get_feature_names())[frequent_terms_idx]
# first twenty features only
plot_x = ["term_"+str(i) for i in feat_names]
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(X.index[sampled_doc_idx])]
plot_z = X_counts_sampled.toarray()
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(20, 15))
ax = sns.heatmap(df_todraw,
cmap="PuRd", annot=True)
It would take a lot of time to plot all terms and docs with seaborn heatmap. If we only want to observe the pattern of the sparse matrix, we can use spy() from matplotlib to visualizes the non-zero values. Although, in this case we cannot observe the frequency of a term, we can only know if the term appears in a certain document.
# Answer here
%matplotlib inline
feat_names_all = count_vect.get_feature_names()
feat_names_all = ["term_"+str(i) for i in feat_names_all]
docs_all = ["doc_"+ str(i) for i in list(X.index)]
plt.figure(figsize=(20, 20)) # specify size
plt.spy(X_counts,markersize=0.2,aspect=10.0) # plot sparse matrix directly
locsX = np.arange(0,X_counts.shape[1],500) # locations where you want to show term
locsY = np.arange(0,len(X.index),100) # locations where you want to show doc
plt.xticks(locsX, np.array(feat_names_all)[locsX.astype(int)],rotation=90) # show terms on x axis
plt.yticks(locsY, np.array(docs_all)[locsY.astype(int)]) # show terms on y axis
plt.show()
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
# Answer here
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
X_reduced = PCA(n_components = 3).fit_transform(X_counts.toarray())
print(X_reduced.shape)
# function to plot 3d-plot
def plot_3d(X_reduced, a, b, categories): # a: elevation, b: azimuth
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(figsize = (15,7))
ax = Axes3D(fig)
for c, category in zip(col, categories):
xs = X_reduced[X['category_name'] == category,0] # x location
ys = X_reduced[X['category_name'] == category,1] # y location
zs = X_reduced[X['category_name'] == category,2] # z location
ax.scatter(xs, ys, zs, c=c, marker='o') # scatter plot for each category
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
ax.view_init(a,b) # set camera position
plt.show()
return ax
angles=[[30,120], [30,150], [80,120]] # give different angles
for ang in angles:
ax = plot_3d(X_reduced, ang[0], ang[1], categories)
term_frequencies=np.asarray(X_counts.sum(axis=0))[0]
If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.
# Answer here
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
fig = go.Figure( # plotly gragh object
data=[go.Bar(y=term_frequencies)],
layout_title_text="Term Frequency"
)
fig.update_yaxes(range=[0,400])
fig.update_xaxes(range=[-0.5, 100.5], # show first 100, move plot to see the rest
tickangle=270,
ticktext=count_vect.get_feature_names(),
tickvals=np.arange(0,len(term_frequencies),1),
tickfont=dict(family='serif', color='black', size=10))
fig.show()
The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.
In order to reduce the number of terms to visualize, we can filter out terms that have very high/low counts, just plot the terms within a certain range of frequency. For example, I choose to visualize terms that have frequency between 200 and 1000, and we can see that there are only 333 terms within this frequency range.
frequent_terms_idx = np.where(np.multiply(term_frequencies>200,term_frequencies<1000))[0] # get term(indices) within frequency range 200~250
z=np.array(count_vect.get_feature_names())
len(z[frequent_terms_idx])
# Answer here
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
fig = go.Figure( # plotly gragh object
data=[go.Bar(y=term_frequencies[frequent_terms_idx])], # use terms with frequency within certain range
layout_title_text="Term Frequency"
)
fig.update_yaxes(range=[0,1100])
fig.update_xaxes(range=[-0.5, 100.5], # show first 100, move plot to see the rest
tickangle=270,
ticktext=np.array(count_vect.get_feature_names())[frequent_terms_idx], # use terms with frequency within certain range
tickvals=np.arange(0,len(term_frequencies[frequent_terms_idx]),1), # use terms with frequency within certain range
tickfont=dict(family='serif', color='black', size=10))
fig.show()
Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below
![]()
# Answer here
# Sort the term frequency and return its index with argsort, reverse the sequence because argsort gives index starting
# from the smallest element in the array.
X_count_rank_idx = term_frequencies.argsort()[(-len(term_frequencies)):][::-1]
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
fig = go.Figure( # plotly gragh object
data=[go.Bar(y=term_frequencies[X_count_rank_idx])],
layout_title_text="Term Frequency"
)
fig.update_yaxes(range=[0,max(term_frequencies)+100])
fig.update_xaxes(range=[-0.5, 100.5], # show first 100, move plot to see the rest
tickangle=270,
ticktext=np.array(count_vect.get_feature_names())[X_count_rank_idx],
tickvals=np.arange(0,len(term_frequencies[X_count_rank_idx]),1),
tickfont=dict(family='serif', color='black', size=10))
fig.show()
Try to generate the binarization using the category_name column instead. Does it work?
Yes it works.
# Ans
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.category_name)
mlb.classes_
X['bin_category'] = mlb.transform(X['category_name']).tolist()
X[0:9]
data_dir = "D:/DMLAB/DM19-Lab1/DM19-Lab1-Homework1/sentiment labelled sentences/sentiment labelled sentences/"
f=open(data_dir + 'amazon_cells_labelled.txt', "r")
amazon =f.readlines() # use readlines to read text file line by line into a list
f.close()
f=open(data_dir + 'imdb_labelled.txt', "r", encoding="utf-8") # utf-8 encoding for imbd
imbd =f.readlines()
f.close()
f=open(data_dir + 'yelp_labelled.txt', "r")
yelp =f.readlines()
f.close()
The list contains sentences with sentiment label at the end. Each text file contains 1000 sentences.
amazon[0:10]
print(len(amazon), len(imbd), len(yelp))
import pandas as pd
# prepare function separates the label and the corresponding sentence
def prepare(data):
sentence = []
sentiment = []
for d in data:
sentiment.append(d.split()[-1]) # get the label
sentence.append(d[0:-3]) #[0:-3] to exclude the \t0\n or \t1\n at the end
return sentence, sentiment
# one dataframe for each dataset
df_ama = pd.DataFrame()
df_im = pd.DataFrame()
df_yelp = pd.DataFrame()
# transform text data into dataframe
df_ama['sentence'], df_ama['sentiment'] = prepare(amazon)
df_im['sentence'], df_im['sentiment'] = prepare(imbd)
df_yelp['sentence'], df_yelp['sentiment'] = prepare(yelp)
Show a a part of the dataframe
df_ama[0:10]
import helpers.data_mining_helpers as dmh
df_ama.isnull().apply(lambda x: dmh.check_missing_values(x))
df_im.isnull().apply(lambda x: dmh.check_missing_values(x))
df_yelp.isnull().apply(lambda x: dmh.check_missing_values(x))
Fortunately, ther are no missing sentences or labels. However, I have notice that there are incomplete sentences within the data set.
import numpy as np
# Check if there are duplicated values and return the indices where there are duplicates with numpy where function
np.where(df_ama.duplicated('sentence')==True)[0]
np.where(df_im.duplicated('sentence')==True)[0]
np.where(df_yelp.duplicated('sentence')==True)[0]
We can see that there are several dulpicates in all three data sets, so we will drop the duplicates except for the first occurrence.
df_ama.drop_duplicates('sentence',keep='first', inplace=True)
# Confirm that there are no duplicates after dropping
print(np.where(df_ama.duplicated('sentence')==True)[0])
print(df_ama.shape)
df_im.drop_duplicates('sentence',keep='first', inplace=True)
print(np.where(df_im.duplicated('sentence')==True)[0])
print(df_im.shape)
df_yelp.drop_duplicates('sentence',keep='first', inplace=True)
print(np.where(df_yelp.duplicated('sentence')==True)[0])
print(df_yelp.shape)
By checking the shape and checking the duplicates again, we can see that dulpicates have been removed.
df_ama_sample = df_ama.sample(n=300) #random state
df_im_sample = df_im.sample(n=300) #random state
df_yelp_sample = df_yelp.sample(n=300) #random state
According to the data description, there are exactly 500 positives and 500 negatives for each dataset. Even after removing duplicates, the ratio should still be close to 1:1. Check if sampled data has ratio close to 1:1.
print(df_ama_sample.sentiment.value_counts())
print(df_im_sample.sentiment.value_counts())
print(df_yelp_sample.sentiment.value_counts())
We can see that that the distribution of the sentiments remain relatively the same for all three datasets.
Before tokenizing and vectorizing, first clean the data. Remove punctuations and unwanted symbols with regular expressions and replace methods.
import re
# define a function that cleans the data
# input : a = list(df_ama['sentence'])
def clean_data(a):
for i in range(len(a)):
a[i].replace("-", " ")
a[i].replace("_", " ")
a[i] = re.sub(r'[^\w\s\r]',' ',a[i])
return a
df_ama.sentence = clean_data(list(df_ama.sentence))
df_im.sentence = clean_data(list(df_im.sentence))
df_yelp.sentence = clean_data(list(df_yelp.sentence))
print(df_ama.sentence[0:5])
Import stopwords and porter stemmer from nltk. Define a tokenizer with word stemming.
from nltk.stem.porter import PorterStemmer
import nltk
from nltk.corpus import stopwords
porter = PorterStemmer()
def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
nltk.download('stopwords')
stop = stopwords.words('english')
stop = [tokenizer_porter(item)[0] for item in stop] # tokeninze the stopwors as well
stop.extend(['becau'])
stop[0:10]
Define a count vectorizer including stopword removal and tokenizer. Fit and transform all three datasets.
from sklearn.feature_extraction.text import CountVectorizer
count_vect_ama = CountVectorizer(stop_words=stop,tokenizer=tokenizer_porter)
ama_counts = count_vect_ama.fit_transform(df_ama.sentence)
print(ama_counts.shape) # check the shape of this matrix
print(count_vect_ama.get_feature_names()[0:10]) # check the first few feature names
count_vect_im = CountVectorizer(stop_words=stop,tokenizer=tokenizer_porter)
im_counts = count_vect_im.fit_transform(df_im.sentence)
print(im_counts.shape)
print(count_vect_im.get_feature_names()[0:10])
count_vect_yelp = CountVectorizer(stop_words=stop,tokenizer=tokenizer_porter)
yelp_counts = count_vect_yelp.fit_transform(df_yelp.sentence)
print(yelp_counts.shape)
print(count_vect_yelp.get_feature_names()[0:10])
from sklearn.decomposition import PCA
# reduce all datasets to 2 dimensions and plot their values
pca1 = PCA(n_components = 2)
ama_reduced = pca1.fit_transform(ama_counts.toarray())
print(ama_reduced.shape)
pca2 = PCA(n_components = 2)
im_reduced = pca2.fit_transform(im_counts.toarray())
print(im_reduced.shape)
pca3 = PCA(n_components = 2)
yelp_reduced = pca3.fit_transform(yelp_counts.toarray())
print(yelp_reduced.shape)
import matplotlib.pyplot as plt
%matplotlib inline
col = ['coral', 'blue']
categories = ['0','1']
# define a function that plots pca values
def plot_2d(col, categories, X_reduced, X):
fig = plt.figure(figsize = (25,10))
ax = fig.subplots()
for c, category in zip(col, categories):
xs = X_reduced[X['sentiment'] == category].T[0]
ys = X_reduced[X['sentiment'] == category].T[1]
ax.scatter(xs, ys, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
plt.show()
plot_2d(col, categories, ama_reduced, df_ama)
plot_2d(col, categories, im_reduced, df_im)
plot_2d(col, categories, yelp_reduced, df_yelp)
Data points with positive/negative sentiments does not seem to be in different areas of the plot, so I proceed to check out the explained variance ration for each data set.
print(pca1.explained_variance_ratio_)
print(pca2.explained_variance_ratio_)
print(pca3.explained_variance_ratio_)
We can see that the explained variance ratios for all three reduced datasets are very small, meaning that it cannot represent the complete datasets very well, so it might not be a good idea to train a classifier using only two pca components.
ama_term_frequencies = np.asarray(ama_counts.sum(axis=0))[0]
im_term_frequencies = np.asarray(im_counts.sum(axis=0))[0]
yelp_term_frequencies = np.asarray(yelp_counts.sum(axis=0))[0]
print(ama_term_frequencies[0:10])
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
# define a function that plots the frequency plot sorted by frequency(same code as in the lab's take home excercise)
def plot_sorted_freq_curve(dataset, term_frequencies, feature_names):
# Sort the term frequency and return its index with argsort, reverse the sequence because argsort gives index starting
# from the smallest element in the array.
X_count_rank_idx = term_frequencies.argsort()[(-len(term_frequencies)):][::-1]
fig = go.Figure( # plotly gragh object
data=[go.Bar(y=term_frequencies[X_count_rank_idx])],
layout_title_text="Term Frequency " + dataset
)
fig.update_yaxes(range=[0,max(term_frequencies)+100])
fig.update_xaxes(range=[-0.5, 100.5], # show first 100, move plot to see the rest
tickangle=270,
ticktext=np.array(feature_names)[X_count_rank_idx],
tickvals=np.arange(0,len(term_frequencies[X_count_rank_idx]),1),
tickfont=dict(family='serif', color='black', size=10))
fig.show()
print('Top 20 words: ',np.array(feature_names)[X_count_rank_idx][0:20]) # print out top ten words
plot_sorted_freq_curve('Amazon',ama_term_frequencies, count_vect_ama.get_feature_names())
plot_sorted_freq_curve('IMBD',im_term_frequencies, count_vect_im.get_feature_names())
plot_sorted_freq_curve('Yelp',yelp_term_frequencies, count_vect_yelp.get_feature_names())
The original sentiment label is only 0 and 1. It it already discretized and binarized.
In this section, I try to find out what words are frequently used in positive/negative sentiment sentences.
First, plot out word frequency of positive sentneces sorted by frequency and print top 20 words.
# Calculate term frequencies for positive sentiment sentences
ama_term_frequencies_p = np.asarray(ama_counts[np.where(df_ama.sentiment=='1')[0]].sum(axis=0))[0]
im_term_frequencies_p = np.asarray(im_counts[np.where(df_im.sentiment=='1')[0]].sum(axis=0))[0]
yelp_term_frequencies_p = np.asarray(yelp_counts[np.where(df_yelp.sentiment=='1')[0]].sum(axis=0))[0]
ama_term_frequencies_p
# Calculate term frequencies for negative sentiment sentences
ama_term_frequencies_n = np.asarray(ama_counts[np.where(df_ama.sentiment=='0')[0]].sum(axis=0))[0]
im_term_frequencies_n = np.asarray(im_counts[np.where(df_im.sentiment=='0')[0]].sum(axis=0))[0]
yelp_term_frequencies_n = np.asarray(yelp_counts[np.where(df_yelp.sentiment=='0')[0]].sum(axis=0))[0]
ama_term_frequencies_n
plot_sorted_freq_curve('Amazon Positive',ama_term_frequencies_p, count_vect_ama.get_feature_names())
plot_sorted_freq_curve('IMBD Positive',im_term_frequencies_p, count_vect_im.get_feature_names())
plot_sorted_freq_curve('Yelp Positive',yelp_term_frequencies_p, count_vect_yelp.get_feature_names())
Second, plot out word frequency of negative sentneces sorted by frequency and print top 20 words.
plot_sorted_freq_curve('Amazon Negative',ama_term_frequencies_n, count_vect_ama.get_feature_names())
plot_sorted_freq_curve('IMBD Negative',im_term_frequencies_n, count_vect_im.get_feature_names())
plot_sorted_freq_curve('Yelp Negative',yelp_term_frequencies_n, count_vect_yelp.get_feature_names())
From doing aggregation respectively on positive and negative sentences, we can see that positive sentences generally contain "good", "great", while negative sentences contains negative words and aspects that customers/audience are not happy about. For example, from "time", "minute", we can infer that many customers are unhappy because it took too long for the restaurant to serve their meals; from "phone","battery", "charge", we can infer that the customer thinks that the phone he bought on amazon has short battery life and need to charge it again very soon.
Generate Word Cloud to see what are the most common words that appear in positive/negative sentiment sentences in all three cases(input for word cloud wasn't stemmed).
import re
from wordcloud import WordCloud
# get the word cloud function from helpers
def plot_word_cloud(text, title):
""" Generate word cloud given some input text doc """
word_cloud = WordCloud().generate(text)
plt.figure(figsize=(8,6), dpi=300)
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.title(title)
plt.show()
# convert all words to lower case
def to_lower(df):
all_positive = list(df.sentence[np.where(df.sentiment=='1')[0]])
print(len(all_positive))
all_negative = list(df.sentence[np.where(df.sentiment=='0')[0]])
print(len(all_negative))
for i in range(len(all_positive)):
all_positive[i] = str(all_positive[i]).lower()
for i in range(len(all_negative)):
all_negative[i] = str(all_negative[i]).lower()
return all_positive, all_negative
all_p_ama, all_n_ama = to_lower(df_ama)
all_p_im, all_n_im = to_lower(df_im)
all_p_yelp, all_n_yelp = to_lower(df_yelp)
plot_word_cloud(' '.join(all_p_ama), 'Amazon Positive')
plot_word_cloud(' '.join(all_n_ama), 'Amazon Negative')
plot_word_cloud(' '.join(all_p_im), 'IMBD Positive')
plot_word_cloud(' '.join(all_n_im), 'IMBD Negative')
plot_word_cloud(' '.join(all_p_yelp), 'Yelp Positive')
plot_word_cloud(' '.join(all_n_yelp), 'Yelp Negative')
Surprisingly, from the Amazon dataset's negative sentences, we can see that there are "good", "great" in the wordcloud, which is different from the result in data exploration. It might be that wordclouds are generated differently(input for word cloud wasn't stemmed), it could also be that some negative sentences are sarcastic, which cannot be idientified by simply counting the frequency of words.
According to the provided reference article, Multinomial naive Bayes is useful to model feature vectors where each value represents the number of occurrences of a term or its relative frequency. So we import Multinomial naive Bayes from sklearn.
We should first split the data into trian/test set before tfidf and classifying.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
# This function plots the confusion matrix of the predicted result
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix')
print(cm)
# This function takes dataframe as input, splits data into train/test, construct tfidf document-term matrix
# , then fit/predict train/test data
def train_tfidf(X,y,data):
class_names = [0,1]
# Split the dataset into train/test, make sure the positive/negative ratio is the same in train/test by setting
# stratify parameter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify=y)
# Construct tfidf vectorizer including stop words and word stemming
tfidf_vect = TfidfVectorizer(stop_words=stop,tokenizer=tokenizer_porter)
# Fit and transform the training data
tfidf_train = tfidf_vect.fit_transform(X_train)
# Transform the testing data
tfidf_test = tfidf_vect.transform(X_test)
#Fit the model with different smoothong parameters
alphas = [0.001, 0.01, 0.1, 1]
print('Parameters and results for ' + str(data))
for a in alphas:
clf = MultinomialNB(alpha=a)
clf.fit(tfidf_train.toarray(), y_train)
y_predict = clf.predict(tfidf_test.toarray())
# Calculate the accuracy and uar for each parameter, also plot the confusion matrix
uar = recall_score(y_test, y_predict, average='macro')
cm = confusion_matrix(y_test, y_predict)
print('alpha = ' + str(a) + '/ accuracy: ' + str(clf.score(tfidf_test.toarray(),y_test)) + '/ UAR: ' + str(uar))
plot_confusion_matrix(cm , class_names)
print("")
print("___________________________________________________________")
# train and predict
train_tfidf(df_ama.sentence, df_ama.sentiment.astype(int), 'Amazon')
train_tfidf(df_im.sentence, df_im.sentiment.astype(int), 'IMBD')
train_tfidf(df_yelp.sentence, df_yelp.sentiment.astype(int), 'Yelp')
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
# This function takes dataframe as input, splits data into train/test, construct word frequency document-term matrix
# , then fit/predict train/test data
def train_count(X,y,data):
class_names = [0,1]
# Split the dataset into train/test, make sure the positive/negative ratio is the same in train/test by setting
# stratify parameter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify=y)
# Construct count vectorizer including stop words and word stemming
count_vect = CountVectorizer(stop_words=stop,tokenizer=tokenizer_porter)
# Fit and transform the training data
count_train = count_vect.fit_transform(X_train)
# Transform the testing data
count_test = count_vect.transform(X_test)
#Fit the model with different smoothong parameters
alphas = [0.001, 0.01, 0.1, 1]
print('Parameters and results for ' + str(data))
for a in alphas:
clf = MultinomialNB(alpha=a)
clf.fit(count_train.toarray(), y_train)
y_predict = clf.predict(count_test.toarray())
# Calculate the accuracy and uar for each parameter, also plot the confusion matrix
uar = recall_score(y_test, y_predict, average='macro')
cm = confusion_matrix(y_test, y_predict)
print('alpha = ' + str(a) + '/ accuracy: ' + str(clf.score(count_test.toarray(),y_test)) + '/ UAR: ' + str(uar))
plot_confusion_matrix(cm , class_names)
print("")
print("___________________________________________________________")
# train and predict
train_count(df_ama.sentence, df_ama.sentiment.astype(int), 'Amazon')
train_count(df_im.sentence, df_im.sentiment.astype(int), 'IMBD')
train_count(df_yelp.sentence, df_yelp.sentiment.astype(int), 'Yelp')
We can see from the results that with both word frequency features and tfidf features, accuracy and UAR increase a little as the smoothing parameter increases. Both methods have similar results in terms of accuracy and UAR
In the lab part, stop words and punctuations are not removed from the data when creating document-term matrix, also, words are not stemmed. We can download stop words from nltk and import PorterStemmer function, and then include both into the count vectorizer function.
In the dimension reduction part, data points for different categories does not seem separated, so it might not be a good idea to use so few pca components to train our classification model.